This workshop is designed to give you a taste of what R can do. We’ll run through the process of importing, tidying, transforming, visualising and reporting data using the suite of packages that make up the tidyverse. These packages help lower the barrier to entry for new users and will hopefully inspire you to code regularly in R.
Learning outcomes
By the end of the workshop you’ll:
R is an open source programming language for statistical analysis and data visualisation. It was developed by Ross Ihaka and Robert Gentleman of the University of Auckland and released in 1995. There are now over 14,000 packages available for R which provide functions for machine learning, genomics, time series forecasting, and interactive graphics amongst many others.
R is widely used in academia and rapidly replacing SPSS and other proprietary statistical software in undergraduate programmes. It is also used by well known companies like Google, Netflix and Airbnb for data analytics. Many graphics published by news outlets like the Financial Times and the BBC are generated in R. The UK Government is also embracing R to help make their statistical reporting workflow more efficient and reproducible.
Since R is a language it is also:
and therefore supports the principles of open science.
There are also several R packages useful for those working in local government. For example, the fingertipsR package enables you to download population and public health indicators.
RStudio is an integrated development environment (IDE) for R. It’s intuitive interface makes working with R much easier. It supports syntax highlighting, tab completion and is integrated with R Markdown.
RStudio is freely available under the GNU Affero General Public License v3. A commercial desktop license is also available.
RStudio has four different panes:
The Console pane (top left) is used to execute R commands immediately.
The Environment pane (top right) shows the datasets, models, and plots that are loaded in the current R session. This pane also contains tabs with a scrollable history of executed code, connections to databases and Git options.
The Files pane (bottom right) shows plots and interactive web content, help documentation, previous commands, and R packages that you can install and load.
The Source pane (bottom left) appears when you open a new file e.g. File -> New File -> R Script. Code can be saved in dedicated .R scripts and executed in the console with Ctl-Enter/Cmd-Enter. Syntax highlighting and tab completion are also available.
The appearance of RStudio can be changed to suit you:
You can also change RStudio’s overall theme. Opting for a dark theme reduces the amount of glare that your eyes are subject to. Change the global theme to Dark by selecting ‘Appearance’ in the Global Options menu and opt for an Editor theme with a dark palette such as ‘Material’.
RStudio has many useful shortcuts that enable you to keep your hands on the keyboard thereby boosting your coding productivity. For example,
Ctrl/Cmd + Enter: Run your selected code in the ConsoleCtrl/Cmd + Shift + M: Add piping operatorCtrl/Cmd + L: Clear the console windowCtrl/Cmd + Shift + R: Add a section breakShift + Ctrl/Cmd + 1: Make the console full screenCtrl/Cmd + Shift + A: Format your codeNB The use of Ctrl or Cmd depends on whether you are using a Windows or a Mac device.
For a complete list of all available shortcuts just type: Alt + Shift + K
The dataset that we will be exploring derives from speedofanimals.com. These are measurements for speed, mass and length for a variety of animals living in the water, air and on land.
The dataset contains 5 variables:
name: the common name of the animalhabitat: whether the animal lives in water, air, or on landlength_cm: the length of the animal in centimetresmass_kg: the mass of the animal in kilogramsspeed_kph: the top speed of the animal in kilometres per hourYou can download the dataset from here.
Adopting a consistent folder structure for your data analyses will help to ensure that your projects are reproducible. A project can be organised using a simple file structure like this:
project/
|
├── data/ # store your datasets
|
├── script.R # your R script
|
└── output/ # all your plots, models etc
Ideally, raw and intermediate datasets would be separated to ensure that you don’t accidentally overwrite your data.
project/
|
├── data/
│ ├── raw/ # read-only pre-processed datasets
│ └── processed/ # intermediate datasets
|
├── script.R # your R script
|
└── output/ # all your plots, models etc
Point your R session to your project folder using: Session > Set Working Directory > Choose Directory
NB It’s not good practice to set your working directory at the top of your R script because absolute paths don’t promote reproducibility.
Optional: Set up a project in RStudio
File > New File > R Script
You only need to install an R package once. Subsequent package updates can be handled by selecting Packages > Update in the Files pane of RStudio.
install.packages("tidyverse")
Packages need to be loaded at the start of every R session to give you access to the functions you need.
library(tidyverse)
R can handle a range of data formats: .csv, .xlsx, .sav, .dta, .sas, .xml, .json, .geojson etc. Some data formats require specific packages.
We can load a CSV file using the read_csv function with an argument which provides the file path to the dataset. A new object called speedofanimals is created which stores the data that we have loaded in the R session.
speedofanimals <- read_csv("data/speed_of_animals.csv")
The glimpse() function prints all the variables, their data types and the first few values.
glimpse(speedofanimals)
## Observations: 64
## Variables: 5
## $ name <chr> "African Bush Elephant", "African Wild Dog", "Black Ma…
## $ habitat <chr> "Land", "Land", "Land", "Land", "Land", "Land", "Land"…
## $ length_cm <dbl> 690, 100, 340, 250, 140, 90, 50, 28, 140, 120, 3, 520,…
## $ mass_kg <dbl> 8000.00, 30.00, 1.40, 620.00, 65.00, 18.00, 6.00, 0.13…
## $ speed_kph <dbl> 40.0, 72.5, 32.2, 35.0, 120.0, 69.0, 48.0, 20.0, 72.4,…
speedofanimals <- read_csv("data/speed_of_animals.csv",
col_types = cols(
name = col_factor(NULL),
habitat = col_factor(NULL),
length_cm = col_integer(),
mass_kg = col_double(),
speed_kph = col_double()
))
speedofanimals
Tidy data are structured for use in R and satisfy three rules:
gather()Convert variable names into values and structure the dataframe into a more machine readable format (wide to long)
speedofanimals_long <- gather(speedofanimals, measure, value, -name, -habitat)
speedofanimals_long
spread()Convert dataframe into a more tabular, human readable format (long to wide)
spread(speedofanimals_long, measure, value)
Note that pivot_long() and pivot_wide() will shortly be available as more intutively named functions.
separate()Split out values stored in one variable into multiple variables
split <- separate(speedofanimals_long, measure, c("measure", "unit"), sep = "_")
split
unite()Join values stored in multiple variables into one
unite(split, measure, c(measure, unit), sep = "_")
select()Pick and rename columns in a dataframe.
select(speedofanimals, name, speed_kph)
select(speedofanimals, -speed_kph)
select(speedofanimals, 1:3)
select(speedofanimals, starts_with("hab"))
select(speedofanimals, animal = name)
filter()Subset rows in a dataframe by an expression.
filter(speedofanimals, length_cm < 100)
filter(speedofanimals, name == "Roadrunner")
filter(speedofanimals, habitat != "Air")
filter(speedofanimals, grepl("African", name))
filter(speedofanimals, mass_kg > 1000 & habitat == "Land")
filter(speedofanimals, name %in% c("Coyote", "Roadrunner"))
arrange()Change the order of rows in a dataframe.
arrange(speedofanimals, speed_kph)
arrange(speedofanimals, desc(speed_kph))
mutate()Create new variables with functions of existing columns in a dataframe.
mutate(speedofanimals, speed_mph = speed_kph * 0.62137)
mutate(speedofanimals, name = factor(name))
mutate(speedofanimals, speed_mph = speed_kph * 0.62137,
name = factor(name))
%>%The piping operator allows you to combine multiple operations together in a chain. For example,
speedofanimals %>%
filter(habitat == "Air") %>%
mutate(speed_mph = speed_kph * 0.62137) %>%
select(name, speed_mph) %>%
arrange(desc(speed_mph))
The pipe obviates the need to create objects for each intermediate transformation of your dataframe and makes your code more human readable. For example, the code above can be translated as:
Filter rows in the speedofanimals dataframe whose values in the habitat column are equal to “air”, then create a new column which calculates the speed in miles per hour of each animal, then select the name and speed_mph columns, and then sort the resulting rows by the values of speed_mph in descending order.
The piping operator makes it easy to apply functions to dataframes grouped by specific variables. For example, the code below use the group_by() and summarise() functions to calculate the mean speed of animals by habitat
speedofanimals %>%
mutate(speed_mph = speed_kph * 0.62137) %>%
group_by(habitat) %>%
summarise(mean_speed_mph = round(mean(speed_mph), 1)) %>%
arrange(desc(mean_speed_mph))
“A statistical graphic is a mapping from data to aesthethic attributes (colour, shape, size) of geometric objects (points, lines, bars).”
Hadley Wickham (2016)
| Variable | Geometry | Aesthetic |
|---|---|---|
| mass_kg | point | x-position |
| speed_kph | point | y-position |
| length_cm | point | size |
| habitat | point | fill |
ggplot() objectggplot()
ggplot(data = speedofanimals,
aes(x = mass_kg, y = speed_kph))
ggplot(data = speedofanimals,
aes(x = mass_kg, y = speed_kph)) +
geom_point()
ggplot(data = speedofanimals,
aes(x = mass_kg, y = speed_kph)) +
geom_point(colour = "tomato")
habitat variable to fill and add a border to the pointsggplot(data = speedofanimals,
aes(x = mass_kg, y = speed_kph, fill = habitat)) +
geom_point(shape = 21, colour = "black", alpha = 0.6)
library(scales)
ggplot(data = speedofanimals,
aes(x = mass_kg, y = speed_kph, fill = habitat)) +
geom_point(shape = 21, colour = "black", alpha = 0.6) +
scale_x_continuous(trans = log_trans(), breaks=c(1,10,100,10000)) +
scale_y_continuous(trans = log_trans(), breaks=c(1,10,100))
ggplot(data = speedofanimals,
aes(x = mass_kg, y = speed_kph, fill = habitat)) +
geom_point(shape = 21, colour = "black", alpha = 0.6) +
scale_x_continuous(trans = log_trans(), breaks=c(1,10,100,10000)) +
scale_y_continuous(trans = log_trans(), breaks=c(1,10,100)) +
labs(title = "Relationship between body mass and speed \namongst selected animals",
caption = "Source: speedofanimals.com",
x = "Body mass (kg, log)",
y = "Top speed (kph, log)",
fill = "Habitat")
ggplot(data = speedofanimals,
aes(x = mass_kg, y = speed_kph, fill = habitat)) +
geom_point(shape = 21, colour = "black", alpha = 0.6) +
scale_x_continuous(trans = log_trans(), breaks=c(1,10,100,10000)) +
scale_y_continuous(trans = log_trans(), breaks=c(1,10,100)) +
labs(title = "Relationship between body mass and speed \namongst selected animals",
caption = "Source: speedofanimals.com",
x = "Body mass (kg, log)",
y = "Top speed (kph, log)",
fill = "Habitat") +
theme_minimal()
length_cm to size and add some further tweaksggplot(data = speedofanimals,
aes(x = mass_kg, y = speed_kph, fill = habitat, size = length_cm)) +
geom_point(shape = 21, colour = "black", alpha = 0.6) +
scale_fill_brewer(palette = "Set2") +
scale_size_continuous(range = c(1, 10)) +
scale_x_continuous(trans = log_trans(), breaks=c(1,10,100,10000)) +
scale_y_continuous(trans = log_trans(), breaks=c(1,10,100)) +
labs(title = "Relationship between body mass and speed \namongst selected animals",
caption = "Source: speedofanimals.com",
x = "Body mass (kg, log)",
y = "Top speed (kph, log)",
fill = "Habitat",
size = "Length (cm)") +
theme_minimal() +
theme(
plot.margin = unit(rep(30, 4), "pt"),
panel.grid.minor = element_blank(),
axis.title = element_text(size = 8, face = "plain", hjust = 1),
plot.caption = element_text(size = 8, hjust = 1, margin = margin(t = 15)))
ggsave("scatterplot.png", scale=1, dpi=300)
library(ggiraph)
speedofanimals$name <- str_replace_all(speedofanimals$name, "'", "")
p <- ggplot(data = speedofanimals,
aes(x = mass_kg, y = speed_kph, fill = habitat, size = length_cm)) +
geom_point_interactive(aes(tooltip = name), shape = 21, colour = "black", alpha = 0.6) +
scale_fill_brewer(palette = "Set2") +
scale_size_continuous(range = c(1, 10)) +
scale_x_continuous(trans = log_trans(), breaks=c(1,10,100,10000)) +
scale_y_continuous(trans = log_trans(), breaks=c(1,10,100)) +
labs(title = "Relationship between body mass and speed \namongst selected animals",
caption = "Source: speedofanimals.com",
x = "Body mass (kg, log)",
y = "Top speed (kph, log)",
fill = "Habitat",
size = "Length (cm)") +
theme_minimal() +
theme(
plot.margin = unit(rep(30, 4), "pt"),
panel.grid.minor = element_blank(),
axis.title = element_text(size = 8, face = "plain", hjust = 1),
plot.caption = element_text(size = 8, hjust = 1, margin = margin(t = 15)))
girafe(code = print(p))
speedofanimals <- speedofanimals %>%
mutate(body_lengths_s_km = ((speed_kph*1000) / 3600) / (length_cm / 100))
select(speedofanimals, name, body_lengths_s_km) %>%
arrange(desc(body_lengths_s_km))
speedofanimals <- speedofanimals %>%
mutate(rel_speed_kph = ((body_lengths_s_km * 1.83) * 3600) / 1000)
select(speedofanimals, name, body_lengths_s_km, rel_speed_kph) %>%
arrange(desc(rel_speed_kph))
The rmarkdown package allows you to write dynamic reports in R. These are plain text files (.Rmd) written in markdown containing chunks of embedded R code.
Markdown is a simple syntax that allows you to add formatting to plain text. By enclosing text with asterisks you can add *emphasis* or write inline `code` with backticks.
Use asterisks and tilde to add emphasis to your text.
**bold** → bold
*italics* → italics
***bold italics*** → bold italics
~~strikethrough~~ → strikethrough
End a line with two or more spaces to create a line break. You can also use HTML break <br /> tags.
To create a header begin a line with a hashtag. Each additional hashtag makes the header smaller.
# 1st level heading
## 2nd level heading
### 3rd level heading
#### 4th level heading
##### 5th level heading
Place the link text in square brackets and the URL path in parentheses.
Download and install the latest version of R [here](https://www.r-project.org).
Download and install the latest version of R here.
Place (optional) explanatory text in brackets and the image URL or path in parentheses, preceded by an exclamation mark.

Image: wikipedia.org
To resize your image just use HTML:
<figure>
<img src="img/polar_bear.jpg" alt="A polar bear" width="30%">
<figcaption>Image: wikipedia.org</figcaption>
</figure>
Inline code
Wrap code in single backticks.
Today’s date is `format(Sys.time(), '%d %B, %Y')`
Today’s date is 08 May, 2019
Code blocks
Place 3 backticks on a line above and below the code block.
```r
ggplot(data = speedofanimals, aes(x = mass_kg, y = speed_kph)) +
geom_point()
```
ggplot(data = speedofanimals, aes(x = mass_kg, y = speed_kph)) +
geom_point()
Blockquotes start the first line with a greater than symbol.
> There are no emergency meetings, no headlines, no breaking news. No one is acting as if we were in a crisis. Even most climate scientists or green politicians keep on flying around the world, eating meat and dairy. […] Today we use 100 million barrels of oil every single day. There are no politics to change that. There are no rules to keep that oil in the ground. So we can't save the world by playing by the rules. Because the rules have to be changed. Everything needs to change. And it has to start today.
> (Greta Thunberg)
There are no emergency meetings, no headlines, no breaking news. No one is acting as if we were in a crisis. Even most climate scientists or green politicians keep on flying around the world, eating meat and dairy. […] Today we use 100 million barrels of oil every single day. There are no politics to change that. There are no rules to keep that oil in the ground. So we can’t save the world by playing by the rules. Because the rules have to be changed. Everything needs to change. And it has to start today. (Greta Thunberg)
Bullet points
Precede each line in a list with a single asterisk, hyphen or plus sign.
- Land
- Air
- Water
Numbered lists
Precede each line in a list with a number and full stop.
1. Land
2. Air
3. Water
To prevent text starting with a number being formatted as a numbered list just add a backslash.
1\. Land
1. Land
Nested lists
Indent each item in the sublist by four spaces.
1. Land:
- African Bush Elephant
- African Wild Dog
- Black Mamba
- Brown Bear
- Cheetah
- Coyote
- Domestic Cat
- Eastern Gray Squirrel
Divide words with hyphens to create the column names and then separate each row with a pipe. Colons are used to align the values in the cells.
| |Column 1 |Column 2 |Column 3 |
|:---- |:------- |:-------:| --------:|
|Row 1 |is |is |is |
|Row 2 |left |centre |right |
|Row 3 |aligned |aligned |aligned |
| Column 1 | Column 2 | Column 3 | |
|---|---|---|---|
| Row 1 | is | is | is |
| Row 2 | left | centre | right |
| Row 3 | aligned | aligned | aligned |
The workflow for creating a dynamic document in R is:
The following list of online books, MOOCs and blog posts should help you to explore R in more depth:
Beginner
- RStudio primers
- R for Data Science
- Stat545
- Hands-On Programming with R - swirl
- Johns Hopkins Data Science Specialization on Coursera
- Data Carpentry lessons
- R Studio Webinars
Intermediate and advanced
- What They Forgot to Teach You About R
- Advanced R
Statistics
- Discovering Statistics Using R
- Statistics: An Introduction Using R
Data visualisation
- Fundamentals of Data Visualization - Data Visualization: A practical introduction
- R Graphics Cookbook
- BBC Visual and Data Journalism cookbook for R graphics
- ggplot2 graphics companion
Style guides
- The tidyverse style guide
- Google’s R Style Guide
R user groups
- Manchester R
- London R
- R-Ladies
Help
- StackOverflow
- RStudio Community
- Twitter #rstats hashtag